oDCM - Opening Lecture

Hannes Datta

Using virtual environment '/Users/hannesdatta/.virtualenvs/r-reticulate' ...
Using virtual environment '/Users/hannesdatta/.virtualenvs/r-reticulate' ...

Welcome to oDCM!

We're about to start with the first lecture of this class.

If you haven't done so, please

Agenda

  • Part 1 (14.45 to about 15.45)
    • Getting to know each other
    • Motivation for the course
    • Course framework and learning goals
    • Agenda and practical arrangements
  • Break
  • Part 2: Python Bootcamp on your laptops (about 16.00 - 17.30)

Disclaimer

  • This is not a class that merely teaches you Python (but, if you invest time, you'll learn Python on the way!)
  • You can also extract web data using other software packages (e.g., R)
  • Mix of students at various levels (e.g., beginners, advanced Python users)
  • Course mostly takes place on campus (based on student feedback); attendance is not mandatory but strongly encouraged
  • I will not record any online or offline sessions; slides will be posted always
  • Consider me your coach, not your distant professor
  • Slow me down if you need to

About myself

  • scraping nerd — learned it in 2008 using Visual Basic in Excel
  • started doing my own research with scraped and API-extracted data in 2012 (so, 10+ years experience)
  • left Germany around your age, now many years in NL
  • Associate Professor at Tilburg University

Key areas of expertise

  • Substantive interests

    • streaming business models (e.g., music, movies)
    • marketing-mix modeling and optimization
    • open science
  • Methodological interests

    • online data collection via APIs and web scraping
    • causal effects with observational data

Teaching activities

Getting to know you

  • What's your background - previous education (e.g., program)?
  • Any experience in Python (or other programming languages)?
  • What are your passions & talents? (+ why I am asking you this…)

Motivation for course

  • started out as a PhD student without data
  • was interested in music, and found website with data (https://last.fm)
  • no best practices in scraping; learnt all by myself and made many mistakes
  • scraping undervalued in academic job market, but, key role in shaping relevance and rigor of your work
  • now scraping and APIs are a large part of what defines me…

Selection of scraping projects I've undertaken

What is scraping, and what are APIs?

With web scraping, you can capture anything you can view in a web browser

With APIs, you obtain official data from a firm in a programmatic way

  • e.g., as a developer, interact with Instagram, Twitter/X, ChatGPT / OpenAI, AWS, …
  • as a researcher, construct data set from analytics firms

Introducting music-to-scrape.org

  • Mock-up streaming service
  • Developed in the last few months, now testing with you guys
  • “Save” and controlled environment to learn scraping and APIs

Screenshot of Music to scrape

Quick web scraper in Python (I)

  • Let's first import some packages
import requests
  • And then call a particular URL (check it out in your browser!)
url = 'https://music-to-scrape.org/'
webrequest = requests.get(url)

Quick web scraper in Python (II)

  • Finally, let's retrieve the weekly top 15 songs (we use HTML tags and attribute-value pairs for this)
from bs4 import BeautifulSoup
soup = BeautifulSoup(webrequest.text)
weekly15 = soup.find('section', {'name':'weekly_15'})
for song in weekly15.find_all('h5'): print(song.text)
The Smashing Pumpkins
Robert Lockwood_ Jr.
Stevie Ray Vaughan And Double Trouble
Joi
Vangelis
Xzibit featuring Jayo Felony and Method Man
Tad
Brian Eno And David Byrne
Mongo Santamaria
Hatebreed
Chelsea
Liars
Lili Ivanova
Sonny Terry & Brownie McGhee
Snow Patrol
  • Works with any website, even anything you see in a browser (e.g., apps)

Quick APIs in Python

  • APIs are official interfaces by firms for programmers to extract or submit data, or obtain access to an algorithm

  • They work like websites (i.e., you can call them with the same snippets as before), but usually you need to pay or at least sign up for the service

# let's get some data from the API of music-to-scrape
api_request = requests.get('https://api.music-to-scrape.org/charts/top-tracks')
  • let's structure the output in the JSON format
processing file: slides.Rpres

Quitting from lines 225-227 [unnamed-chunk-5] (slides.Rpres)
Execution halted